Introduction to Linear Regression

Md Zulquar Nain

Linear Regression

Linear Regression method is one of the most widely used and common methods examining the linear relationship of the dependent and independent variable(s).
There will be one dependent variable usually represented by \(Y\).
Independent variables may be one or more than one usually represented by \(Xs\).

The problem

We want to examine
does the life expectancy depends on the Income Level?
To measure the Income leevel, we will be using GDP Per Capita Income
Data obtained from https://data.worldbank.org/country/india

Importing the Data File

# importing data from `csv` file
datar <- read.csv("sdata.csv")

datar - name of the imported data file inR
sdata.csv name of the csv file being imported

Exploring the Dataset I

Class, structure and dimension of the dataset

# Structure of the data
str(datar)

'data.frame':   62 obs. of  3 variables:
 $ Year : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
 $ LE   : num  45.2 45.4 45.7 45.9 46.2 ...
 $ GDPPC: num  165 168 168 175 183 ...

#Class of the data
class(datar)

[1] "data.frame"

# Dimension of the data
dim(datar)

[1] 62  3

Exploring the Dataset II

First n rows of observations of the data set
- head(data.frame name, n)
Last n rows of observations of the data set
- tail(data.frame name, n)

# View top two rows of the data
head(datar,2)

  Year     LE    GDPPC
1 1960 45.218 165.2733
2 1961 45.398 167.5203

# View bottom two rows 
tail(datar,2)

   Year    LE     GDPPC
61 2020 70.15  980.1808
62 2021 67.24 1060.4024

The Formula

The mathematical formula of the linear regression can be written as follow:

\[y = \beta_0 + \beta_1*x + u\]

We say \(y\) depends on \(x\) and read as \(y\) is equal to \(\beta_1\) times \(x\), plus a constant \(\beta_0\), plus an error term \(u\).”
When you have multiple independent variables, the equation can be written as \(y = \beta_0 + \beta_1\times x_1 + \beta_2\times x_2 + ... + \beta_n\times x_n\), where:
\(\beta_0\) is the intercept,
\(\beta_1, \beta_2, \cdots,\beta_n\) are the regression or slope coefficients associated with the predictors \(x_1, x_2, \cdots, x_n\).
\(u\) is the error term- the part of \(y\) that can be explained by the regression model

Visualization of Data

Before estimating a simple linear regression model, visualize the data to gain an understanding of the relationship.
Make use of scatterplot

the figure shows that there is positive relationship between life expectancy and Income level

Simple Lnear Regression Model: The estimation

Linear Regression in R can be estimated using lm function
The lm command takes the variables in the format:
lm([target/dependent var] ~ [predictor / independent var], data = [data source])
To know more use help(lm)

Estmation

rlm <- lm(formula = LE ~ GDPPC,
         data = datar)
summary(rlm)


Call:
lm(formula = LE ~ GDPPC, data = datar)

Residuals:
   Min     1Q Median     3Q    Max 
-8.627 -3.498  1.082  3.559  4.791 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.26149    0.93112   50.76   <2e-16 ***
GDPPC        0.02698    0.00193   13.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.02 on 60 degrees of freedom
Multiple R-squared:  0.7651,    Adjusted R-squared:  0.7612 
F-statistic: 195.4 on 1 and 60 DF,  p-value: < 2.2e-16

The Output

The summary outputs shows 6 components, including:
Call: Shows the function call used to compute the regression model.
Residuals: Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

The output: Interpretation

The first step in interpreting the simple/multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen thatp-valueof theF-statisticis less than2.2e-16`, which is highly significant. This means that, predictor variables is significantly related to the outcome variable.
Next is Coefficients significance

The output: Interpretation

summary(rlm)$coeff

               Estimate  Std. Error  t value     Pr(>|t|)
(Intercept) 47.26148728 0.931115969 50.75790 5.373627e-51
GDPPC        0.02697652 0.001929822 13.97876 1.565652e-20

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that, change in income level is significantly associated to the changes in life expectancy in India.
For a given predictor variable, the coefficient \(\beta\) can be interpreted as the average effect on \(y\) of a one unit increase in predictor(\(x\)).
In our example as income (GDP per capita) increases by one 100 Rs, life expectancy is increasing by 2.69 years.

Model accuracy

Next step is to check how good is the model that means how well the model explains the data.
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:
Residual Standard Error (RSE),
R-squared \((R^2)\) and \(adjusted~R^2\),
F-statistic

Model accuracy

Residual Standard Error (RSE),
The RSE (or model sigma), corresponding to the prediction error, represents roughly the average difference between the observed outcome values and the predicted values by the model.
The lower the RSE the best the model fits to our data.
Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.
In this example, the RSE = 4.02, meaning that the observed values deviate from the predicted values by approximately 4.02 units in average.

Model accuracy

R-squared \((R^2)\) and \(adjusted~R^2\),
The R-squared \((R^2)\) ranges from 0 to 1 and represents the proportion of variation in the outcome variable that can be explained by the model predictor variables.
The \((R^2)\) measures, how well the model fits the data. The higher the R2, the better the model.
However, a problem with the \((R^2)\), is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the outcome.
A solution is to adjust the \((R^2)\) by taking into account the number of predictor variables.
the adjusted R-squared is better measure
An (adjusted) \((R^2)\) that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
In thi example, the adjusted R2 is 0.7612, which is good.

Resources

For more help and details visit the following sources
https://www.econometrics-with-r.org/4-lrwor.html
https://stats.oarc.ucla.edu/r/seminars/introduction-to-regression-in-r/
https://stats.oarc.ucla.edu/stata/output/regression-analysis/
https://www.scribbr.com/statistics/linear-regression-in-r/
Read Basic Econometrics by Gujarati

Introduction to Linear Regression

Linear Regression

The problem

Importing the Data File

Exploring the Dataset I

Exploring the Dataset II

The Formula

Visualization of Data

Simple Lnear Regression Model: The estimation

Estmation

The Output

The output: Interpretation

The output: Interpretation

Model accuracy

Model accuracy

Model accuracy

Resources

THANKS